<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>Extending Indexing</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<link rel="stylesheet" type="text/css" charset="utf-8" media="all" href="docs.css">
</head>

<body>
<!--!bodystart-->
[<a href="terrier_develop.html">Previous: Developing with Terrier</a>] [<a href="index.html">Contents</a>] [<a href="extend_retrieval.html">Next: Extending Retrieval</a>]
<table width="100%">
  <tr> 
    <td width="82%" valign="bottom"><h1>Extending Indexing in Terrier</h1></td>
		<!--!bodyremove-->
    <td width="18%"><a href="http://ir.dcs.gla.ac.uk/terrier/"><img src="images/terrier-logo-web.jpg" border="0"></a></td>
		<!--!/bodyremove-->
  </tr>
</table>

<p align="justify">Unless your data is in files (1 file per document), you will probably need to
create your own collection decoder. This is done by implementing the 
Collection interface (<a href="javadoc/uk/ac/gla/terrier/indexing/Collection.html">uk.ac.gla.terrier.indexing.Collection</a>), and writing 
your own indexing application. (See the classes <a href="javadoc/uk/ac/gla/terrier/applications/TRECIndexing.html">uk.ac.gla.terrier.applications.TRECIndexing</a>, 
or <a href="javadoc/uk/ac/gla/terrier/applications/desktop/DesktopTerrier.html">uk.ac.gla.terrier.applications.desktop.DesktopTerrier</a>).</p>

<p align="justify">If your documents are of a non-standard format, then we would advise you 
create your own Document implementation as well. You'll need to implement 
the interface <a href="javadoc/uk/ac/gla/terrier/indexing/Document.html">uk.ac.gla.terrier.indexing.Document</a>. The purpose of a Document object is parse a Document, and identify terms, which should be returned in order of their occurrence. Optionally, you can designate the fields that the terms exist in, and if configured the indexer will note the fields that each term occur in within a document.</p>

<h2>Classical two-pass indexing</h2>

<p align="justify">Essentially, you can now use the BasicIndexer or the BlockIndexer 
  to index your collection. The BlockIndexer provides the same functionality as 
  BasicIndexer, but uses larger DirectIndex and InvertedIndex for storing the 
  positions that each word occurs at in each document. This allows querying to 
  use term positions information - for example Phrasal search (&quot;&quot;) and proximity 
  search (&quot;&quot;~10). For more details about the querying process, you may refer to 
  <a href="extend_retrieval.html">querying with Terrier</a> and the description of the 
  <a href="querylanguage.html">query language</a>.</p>

 <p align="justify">The indexer iterates through the documents of the collection and sends each term found
through the <a href="javadoc/uk/ac/gla/terrier/terms/TermPipeline.html">TermPipeline</a>. The TermPipeline
transforms the terms, and can remove terms that should not be indexed. The TermPipeline chain in use is
<tt>termpipelines=Stopwords,PorterStemmer</tt>, which removes terms from the document using the <a href="javadoc/uk/ac/gla/terrier/terms/Stopwords.html">Stopwords</a> object, and then applies Porter's Stemming algorithm for English to the terms (<a href="javadoc/uk/ac/gla/terrier/terms/PorterStemmer.html">PorterStemmer</a>). If you wanted to use a different stemmer, this is the point at which it should be implemented.
</p>

<p align="justify"> Once terms have been processed through the TermPipeline, they are aggregated by the <a href="javadoc/uk/ac/gla/terrier/structures/indexing/DocumentPostingList.html">DocumentPostingList</a> and the <a href="javadoc/uk/ac/gla/terrier/structures/indexing/LexiconMap.html">LexiconMap</a>, to create the following data structures:</p>
<ul>
<li><a href="javadoc/uk/ac/gla/terrier/structures/DirectIndex.html">DirectIndex</a> : a compressed file, where we store the terms contained in each document. The direct index is used for automatic query expansion. Built by the <a href="javadoc/uk/ac/gla/terrier/structures/indexing/DirectIndexBuilder.html">DirectIndexBuilder</a>s</li>

<li><a href="javadoc/uk/ac/gla/terrier/structures/DocumentIndex.html">DocumentIndex</a> : a fixed-length entry file, where we store information about documents, such as the number of indexed tokens (document length), 
   the identifier of a document, and the offset of its corresponding entry 
   in the direct index. Created by the <a href="javadoc/uk/ac/gla/terrier/structures/indexing/DocumentIndexBuilder.html">DocumentIndexBuilder</a></li>

<li><a href="javadoc/uk/ac/gla/terrier/structures/Lexicon.html">Lexicon</a> : a fixed-length entry file, where we store information about the 
   vocabulary of the indexed collection. Built as a series of temporary Lexicons by the <a href="javadoc/uk/ac/gla/terrier/structures/indexing/LexiconBuilder.html">LexiconBuilder</a>s, which
are then merged at the end of the DirectIndex build phase. </li>
</ul>

<p align="justify">As the indexer iterates through the documents of the collection, it appends 
the direct and document indexes. For saving the vocabulary information, 
the indexer creates temporary lexicons for parts of the collection, which 
are merged once all the documents have been processed.</p>

<p align="justify">Once the direct index, the document index and the lexicon have been created,
the inverted index is created, by the <a href="javadoc/uk/ac/gla/terrier/structures/indexing/InvertedIndexBuilder.html">InvertedIndexBuilder</a>, which inverts the direct index.</p>

<h3>Single-pass indexing</h3>
<p align="justify">Terrier 2.0 adds the single-pass indexing architecture. In this architecture, indexing is performed to build up in-memory posting lists (<a href="javadoc/uk/ac/gla/terrier/structures/indexing/singlepass/MemoryPostings.html">MemoryPostings</a> containing <a href="javadoc/uk/ac/gla/terrier/structures/indexing/singlepass/Posting.html">Posting</a> objects), which are written to disk as 'runs' by the <a href="javadoc/uk/ac/gla/terrier/structures/indexing/singlepass/RunWriter.html">RunWriter</a> when most of the available memory is consumed.</p>

<p align="justify">Once the collection has been parsed, all runs are merged by the <a href="javadoc/uk/ac/gla/terrier/structures/indexing/singlepass/RunsMerger.html">RunsMerger</a>, which uses a <a href="javadoc/uk/ac/gla/terrier/structures/indexing/singlepass/SimplePostingInRun.html">SimplePostingInRun</a> to represent each posting list when iterating through the contents of each run.</p>

<p align="justify">If a direct file is required, the <a href="javadoc/uk/ac/gla/terrier/structures/indexing/singlepass/Inverted2DirectIndexBuilder.html">Inverted2DirectIndexBuilder</a> can be used to create one.</p>

<h2>Changing Indexing</h2>
<p align="justify">To replace the default indexing structures in Terrier with others is very easy, as the data.properties file contains information about which classes should be used to load the four main data structures of the Index: <a href="javadoc/uk/ac/gla/terrier/structures/DocumentIndex.html">DocumentIndex</a>, <a href="javadoc/uk/ac/gla/terrier/structures/DirectIndex.html">DirectIndex</a>, <a href="javadoc/uk/ac/gla/terrier/structures/Lexicon.html">Lexicon</a> and <a href="javadoc/uk/ac/gla/terrier/structures/InvertedIndex.html">InvertedIndex</a>. To implement a replacement index data structure, it may sufficient to subclass a builder, and then subclass the appropriate Indexer class to ensure is used.</p>

<p align="justify">Adding other data structures to a Terrier index is also easy. The <a href="javadoc/uk/ac/gla/terrier/structures/Index.html">Index</a> object contains methods such as <a href="javadoc/uk/ac/gla/terrier/structures/Index.html#addIndexStructure(java.lang.String,%20java.lang.String)">addIndexStructure(String, String)</a> which allow a class to be associated with a structure name (e.g. uk.ac.gla.terrier.structures.InvertedIndex is associated to the "inverted" structure. You can retrieve your structure by casting the result of <a href="javadoc/uk/ac/gla/terrier/structures/Index.html#getIndexStructure(java.lang.String)">getIndexStructure(String)</a>. For structures with more complicated constructors, other addIndexStructure methods are provided. Finally, your application can check that the desired structure exists using <a href="javadoc/uk/ac/gla/terrier/structures/Index.html#hasIndexStructure(java.lang.String)">hasIndexStructure(String)</a>.</p> 
 
<p align="justify">Terrier indices specify the random-access and in-order structure classes for each of the main structure types: direct, inverted, lexicon and document. When generating new data structures, it is good practice to provide in-order as well as random-access to your classes, should other developers wish to access these index structures at another indexing stage.</p>

[<a href="terrier_develop.html">Previous: Developing with Terrier</a>] [<a href="index.html">Contents</a>] [<a href="extend_retrieval.html">Next: Extending Retrieval</a>]
<!--!bodyend-->
<hr>
<small>
Webpage: <a href="http://ir.dcs.gla.ac.uk/terrier">http://ir.dcs.gla.ac.uk/terrier</a><br>
Contact: <a href="mailto:terrier@dcs.gla.ac.uk">terrier@dcs.gla.ac.uk</a><br>
<a href="http://www.dcs.gla.ac.uk/">Department of Computing Science</a><br>

Copyright (C) 2004-2008 <a href="http://www.gla.ac.uk/">University of Glasgow</a>. All Rights Reserved.
</small>
</body>
</html>
